As before, I work with the "Alcohol Effects On Study" dataset. In my first work, I investigated the dataset and looked for the best regressors (html version, ipynb version). In the next work, I analyzed explanations of a boosting model with the SHAP method (html version, ipynb version).
This time, I use the same data preprocessing as previously, leaving only nine explanatory variables. Also, I work only with the "maths" dataset.
In this work, I use two models: a booster model (XGBRegressor) and a Support Vector Machine Regressor (SVR). The XGBRegressor parameters are the same as in the previous work. For a good comparison, I shuffled the dataset with the same random seed and I picked the same examples as in the previous work.
For Pupil 0 the XGBRegressor model predicts grade 12.74. I decomposed the prediction using LIME method implemented in dalex library with random seed = 0:
Using lime library, it's also possible to get a differently formatted plot with these results.
The LIME decompositions are fairly stable. I used a few different random seeds and got similar plots. For example, for random seed = 1 we can observe that the most important variables are the same and have a very similar impact. There is just one difference in the order of the variables. For random seed = 1 the positive contribution of low alcohol consumption on week days outweighs the negative contribution of the family support, and for random seed = 0 it was the other way around.
For Pupil 314 the model predicts grade 13.23. In this case, the LIME decompositions are less stable but still they don't differ too much for various random seeds. Here is the plot for random seed = 1:
For random seed = 2 the five most important variables have similar contribution for the model's predictions. However, this time the negative impact of a high alcohol consumption on week days is estimated as higher, and the negative impact of not getting paid maths classes is estimated as lower. An interesting thing is that this time the pupil's weekend alcohol consumption (which is a bit above the mean) is interpreted as having a negative contribution, whereas previously the contribution was estimated as positive.
As we can see, LIME gives a bit more detailed explanations than SHAP. Not only it says what was the importance of the variables but also it gives an information what is the interval the model considers for every variable when deciding how the variable contributes to the prediction.
It's also interesting to compare LIME contributions with Shapley values for the surprising example of Pupil 1 which I discussed in the previous work. Similarily to the SHAP values, the contribution of no failures and no absences is considered very big. However, this time we don't see the schocking big positive impact of a high alcohol consumption on weekends which we saw for the SHAP method.
I trained a support vector regressor and compared explanations for its predictions with the explanations for the boosting model. Below, I include plots for Pupils 0, 1 and 314.
It can be seen that this model puts much more importance to the number of past failures. Moreover, we can see that a high alcohol consumption on weekends of Pupil 0 and Pupil 1 have a significant negative impact on the model's prediction. One could say that such a support vector machine model gives results which could be more interesting for the creators of the dataset who wanted to find the alcohol effects on final grades of pupils.
!pip install dalex
!pip install lime
import dalex as dx
import xgboost
import lime
import sklearn
from sklearn.svm import SVR
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
maths_dataset = pd.read_csv('Maths.csv')
portuguese_dataset = pd.read_csv('Portuguese.csv')
categorical_variables = ['schoolsup', 'famsup', 'paid', 'higher']
numerical_variables = ['studytime', 'failures', 'Dalc', 'Walc', 'absences']
def preprocess_dataset(df):
new_df = df[['studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'higher', 'Dalc', 'Walc', 'absences', 'G3']]\
.sample(frac=1, random_state=0).reset_index(drop=True)
new_df.loc[:, df.dtypes == 'object'] = new_df.select_dtypes(['object'])\
.apply(lambda x: x.astype('category'))
X, y = new_df.drop(columns='G3'), new_df.G3
X = pd.get_dummies(X, columns=categorical_variables, drop_first=True)
return X, y
def fit_boost_model(X, y):
model = xgboost.XGBRegressor(
n_estimators=500,
max_depth=3,
max_leaves=64,
use_label_encoder=False
)
model.fit(X, y)
return model
def explain_with_dalex(model, X, y, observations=(0, 1, 314)):
explainer = dx.Explainer(model, X, y)
for i in observations:
observation = X.iloc[[i]]
print(f"Model's prediction for Pupil %d is %.2f" % (i, explainer.predict(observation)))
for seed in range(4):
random.seed(seed)
np.random.seed(seed)
explanation = explainer.predict_surrogate(observation)
print(f"For random seed %d the explanation is:\n" % seed, explanation.result)
explanation.plot()
plt.show()
def explain_with_lime(model, X, y, observations=(0, 1, 314)):
lime_explainer = lime.lime_tabular.LimeTabularExplainer(
training_data=X.values,
feature_names=X.columns,
mode="regression"
)
for i in observations:
observation = X.values[i]
lime_explanation = lime_explainer.explain_instance(
data_row=observation,
# Even though the explainer is defined with correct feature names,
# calling model.predict yields a feature_names mismatch error
# That's why I needed to use validate_features=False
predict_fn=lambda d: model.predict(d, validate_features=False)
)
_ = lime_explanation.as_pyplot_figure()
_ = lime_explanation.show_in_notebook()
plt.show()
def explain_svm_model(X, y, observations=(0, 1, 314)):
svm_ohe = SVR()
svm_ohe.fit(X, y)
explainer_svm = dx.Explainer(svm_ohe, X, label="SVM", verbose=False)
for i in observations:
observation = X.iloc[[i]]
explanation_svm = explainer_svm.predict_surrogate(observation)
explanation_svm.plot(return_figure=True)
_ = plt.title(f'Explaining SVM predicting {np.round(explainer_svm.predict(observation).item(), 4)} for Pupil {i}')
plt.show()
maths_X, maths_y = preprocess_dataset(maths_dataset)
maths_model = fit_boost_model(maths_X, maths_y)
explain_with_dalex(maths_model, maths_X, maths_y)
explain_with_lime(maths_model, maths_X, maths_y)
explain_svm_model(maths_X, maths_y)